It is highly recommanded to follow the theorical session before start this training.

The aim of this session will be explore several methods/packages to eval estimates usually wanted in MR studies.


R environment

In the following session, under R 4.2.1, we will use following R packages :

Warning : Loading {TwoSampleMR} leads to conflict with {MendelianRandomization} masking mr_ivw and mr_median functions.
It is also worth noting the function dat_to_MRInput in the {TwoSampleMR} package that will convert from the TwoSampleMR format to the MendelianRandomization format.


Dataset

You have to open the Rmd file MR_Practical_Session.Rmd to access to not-shown-chunk named dataset, so you will get the data by copy and past the objects (coursedata, ratio.all and diabetes_data) in your R Console.

Warning: the dataset we gave you was already harmonised - all the associations were matched up to be on the same strand, and going in the same direction. In reality, you will often need to check that these alignments are correct before using the data.

MR-Base web-app can do this for you automatically under certain set rules (e.g. remove all palindromic snps with a MAF>=0.3), but we highly highly recommend doing this by hand if you are unfamiliar with the process and the choices being made.

## load coursedata
## load ratio.all
## load diabetes_data

Ratio Method

To apply the ratio method, we will use the dataset coursedata (individual level data) providing phenotypes (xand y) and genotypes (g columns).

head(coursedata)
##   ID g1 g2 g3 g4           x          y y.bin
## 1  1  1  0  0  1  0.51141400 -1.2970720     1
## 2  2  0  0  0  0 -1.05450500  0.9117613     0
## 3  3  1  1  0  1 -0.56296820  1.2791060     0
## 4  4  0  0  0  0  0.86149410 -1.4378610     1
## 5  5  0  0  0  1  0.01561589  2.7192830     0
## 6  6  2  0  0  0  0.68330030  1.6415770     0

Ratio estimates for variant j

\[\theta_{Y_j} = \frac{\hat{\beta}_{Y_j}}{\hat{\beta}_{X_j}}\]

Where j = 1 in this example.

# Genetic association with the outcome
by1 <- lm(formula = y ~ g1, data = coursedata)$coef[2]
# Genetic association with the exposure
bx1 <- lm(formula = x ~ g1, data = coursedata)$coef[2]
beta.ratio1 <- by1 / bx1
cat("Ratio estimate for g1 =", beta.ratio1)
## Ratio estimate for g1 = -0.06302634

Standard error of ratio estimate (first order)

\[se_{first}(\theta_{Y_j}) = \frac{se(\hat{\beta}_{Y_j})}{|\hat{\beta}_{X_j}|}\]

# Standard error of the G-Y association
byse1 <- summary(lm(formula = y ~ g1, data = coursedata))$coef[2, 2]
se.ratio1first <- byse1 / sqrt(bx1^2)
## sqrt(bx1^2)  => because se is always positive.
cat("Standard error (first order) of the ratio estimate g1 =", se.ratio1first)
## Standard error (first order) of the ratio estimate g1 = 0.6451987

Standard error of ratio estimate (second order)

\[se_{second}(\theta_{Y_j}) = \sqrt{\frac{se(\hat{\beta}_{Y_j})^2}{\hat{\beta}_{X_j}^2} + \frac{\hat{\beta}_{Y_j}^2 se(\hat{\beta}_{X_j})^2}{\hat{\beta}_{X_j}^4}}\]

# Standard error of the G-X association
bxse1 <- summary(lm(formula = x ~ g1, data = coursedata))$coef[2, 2]
se.ratio1second <- sqrt(byse1^2 / bx1^2 + by1^2 * bxse1^2 / bx1^4)
cat("Standard error (second order) of the ratio estimate  g1 =", se.ratio1second)
## Standard error (second order) of the ratio estimate  g1 = 0.6459625

F-statistic

The F-statistic from the regression of the risk factor on the genetic variant(s) is used as a measure of ‘weak instrument bias’, with smaller values suggesting that the estimate may suffer from weak instrument bias.
For instance, some studies recommend excluding genetic variants if they have a F-statistic less than 10.

fstat1 <- summary(lm(formula = x ~ g1, data = coursedata))$f[1]
cat("F-stat =", fstat1)
## F-stat = 4.027979

Ratio estimates for a Binary Outcome

In a case control setting, it is usual to regress the risk factor on the genetic variant in controls only because the case-control sample is generally unrepresentative of the population and if measurements of the risk factor are made post-disease = Avoid possible reverse causation.

# logistic regression for gene-outcome association
by1.bin <- glm(formula = y.bin ~ g1, data = coursedata, family = binomial)$coef[2]
byse1.bin <- summary(glm(formula = y.bin ~ g1, data = coursedata, family = binomial))$coef[2, 2]
# linear regression in the controls only
# bx1.bin <- lm(coursedata$x[coursedata$y.bin==0]~coursedata$g1[coursedata$y.bin ==0])$coef[2]
bx1.bin <- lm(x ~ g1, data = coursedata[coursedata$y.bin == 0, ])$coef[2]
beta.ratio1.bin <- by1.bin / bx1.bin
# beta.ratio1.bin #ratio estimate for g1
cat("Ratio estimate for g1 with binary outcome =", beta.ratio1.bin, "\n")
## Ratio estimate for g1 with binary outcome = 0.2599639
se.ratio1.bin <- byse1.bin / bx1.bin
# se.ratio1.bin #standard error of the ratio estimate for g1
cat("Standard error of the ratio estimate for g1 with binary outcome =", se.ratio1.bin)
## Standard error of the ratio estimate for g1 with binary outcome = 0.5921551

Two-Stage Least Squares Method (TSLS) Method

When multiple variants G as Instrumental Variables, TSLS (or 2SLS) is performed by:

\[X \sim G_1 + G_{.\ .\ .} + G_j\]

i.e. regressing the risk factor on all the genetic variants in the same model, and storing the fitted values of the risk factor. And

\[Y \sim \hat{X}\]

i.e. a regression is then performed with the outcome on the fitted values of the risk factor.

Estimates

By hand we can make the computation as follow (but it is not recommanded !)

by.hand.fitted.values <- lm(x ~ g1 + g2 + g3 + g4, data = coursedata)$fitted
by.hand <- lm(coursedata$y ~ by.hand.fitted.values)
cat("TSLS estimate =", summary(by.hand)$coef[2], "\n")
## TSLS estimate = 0.570785
cat("TSLS standard error =", summary(by.hand)$coef[2, 2])
## TSLS standard error = 0.2016973

Performing two-stage least squares by hand is generally discouraged, as the standard error in the second-stage of the regression does not take into account uncertainty in the first-stage regression. The R package {ivpack} performs the two-stage least squares method using the ivreg function:

# library(ivpack) ## Depends AER for ivreg
ivmodel.all <- ivreg(y ~ x | g1 + g2 + g3 + g4, x = TRUE, data = coursedata)
# 2SLS estimates
cat("TSLS estimate =", summary(ivmodel.all)$coef[2], "\n")
## TSLS estimate = 0.570785
cat("TSLS standard error =", summary(ivmodel.all)$coef[2, 2])
## TSLS standard error = 0.2291837

F-statistic

cat("F-stat =", summary(lm(x ~ g1 + g2 + g3 + g4, data = coursedata))$f[1])
## F-stat = 10.5824

TSLS Estimates for a Binary Outcome with g1

First stage with a binary outcome is only conducted on controls, second step is done with a logistic regression.

# values for g1 in the controls only
g1.con <- coursedata$g1[coursedata$y.bin == 0]
# values for the risk factor in the controls only
x.con <- coursedata$x[coursedata$y.bin == 0]
# Generate predicted values for all participants based on the linear regression in the controls only :
predict.con.g1 <- predict(lm(x.con ~ g1.con), newdata = list(g1.con = coursedata$g1))
# Fit a logistic regression model on all the participants
tsls1.con <- glm(coursedata$y.bin ~ predict.con.g1, family = binomial)
cat("TSLS estimate for a binary outcome =", summary(tsls1.con)$coef[2], "\n")
## TSLS estimate for a binary outcome = 0.2599639
cat("TSLS standard error for a binary outcome =", summary(tsls1.con)$coef[2, 2])
## TSLS standard error for a binary outcome = 0.5921551

Observation : With a single instrumental genetic variation, TSLS and Ratio method give identical estimates.
beta.ratio1.bin = summary(tsls1.con)$coef[2]

TSLS Estimates for a Binary Outcome with all G

With multi genetic variants as IV, the TSLS estimate is a weighted average of the ratio estimates.

# g1.con <- coursedata$g1[coursedata$y.bin == 0] # values for g2 in the controls only
g2.con <- coursedata$g2[coursedata$y.bin == 0] # values for g2 in the controls only
g3.con <- coursedata$g3[coursedata$y.bin == 0] # values for g3 in the controls only
g4.con <- coursedata$g4[coursedata$y.bin == 0] # values for g4 in the controls only
predict.con <- predict(
  lm(x.con ~ g1.con + g2.con + g3.con + g4.con), # Predicted values
  newdata = c(
    list(g1.con = coursedata$g1),
    list(g2.con = coursedata$g2),
    list(g3.con = coursedata$g3),
    list(g4.con = coursedata$g4)
  )
)
# Logistic regression
tsls1.con.all <- glm(coursedata$y.bin ~ predict.con, family = binomial)
cat("TSLS estimate for a binary outcome =", summary(tsls1.con.all)$coef[2], "\n")
## TSLS estimate for a binary outcome = 0.9410501
cat("TSLS standard error for a binary outcome =", summary(tsls1.con.all)$coef[2, 2])
## TSLS standard error for a binary outcome = 0.3753075
  • Questions ?

Does ivreg function work for binary outcome ? like ivmodel.all.bin = ivreg(y.bin~x|g1+g2+g3+g4, x = TRUE)

Answer : So the function works with a binary trait, in that it will run and give you an answer. But it doesn’t treat the trait as special because it is binary and the estimate just represents the absolute change in the exposure as it is a continuous variable. Hence if you want an odds ratio or a relative risk estimate, we suggest to use the two-stage method. Provided that genetic instruments are strong, the overprecision due to not accounting for uncertainty in the first-stage regression is typically small, even negligible.


Inverse Variance Weighted (IVW) Method

We can use summarized data (the genetic associations with the risk factor and with the outcome, with their standard errors) to estimate the causal effect of the risk factor on the outcome via the inverse-variance weighted (IVW) method.

IVW Estimates (by hand)

To do so, we will use the dataset ratio.all,

ratio.all
##                           g1         g2         g3         g4
## by              -0.008550605  0.2565551  0.2778433 0.17188563
## byse             0.087532274  0.1325076  0.1316129 0.12653807
## bx               0.135667161  0.4938326  0.3475786 0.06788043
## bxse             0.067597577  0.1015321  0.1014761 0.09798360
## beta.ratio      -0.063026340  0.5195184  0.7993682 2.53218242
## se.ratio.first   0.645198685  0.2683249  0.3786566 1.86413185
## se.ratio.second  0.645962479  0.2888032  0.4447983 4.10305056
## fstat            4.027979155 23.6566409 11.7321775 0.47993491
## MAF              0.301500000  0.1010000  0.1030000 0.11150000
bx <- ratio.all["bx", ]
by <- ratio.all["by", ]
bxse <- ratio.all["bxse", ]
byse <- ratio.all["byse", ]

and apply following formulas :

\[\hat{\theta}_{IVW} = \frac{\sum_{j=1}^{n} \hat{\theta}_j /var(\hat{\theta}_j)}{\sum_{j=1}^{n} 1 / var(\hat{\theta}_j)}\]

\[var(\hat{\theta}_j) = \frac{var(\beta_{Y_j})}{\beta^2_{X_j}} = \frac{\sigma^2_{Y_j}}{\beta^2_{X_j}}\]

beta.ivw <- sum(bx * by * byse^-2) / sum(bx^2 * byse^-2)
cat("IVW estimate  = ", beta.ivw, "\n")
## IVW estimate  =  0.567561
se.ivw <- 1 / sqrt(sum(bx^2 * byse^-2))
cat("Standard error of the IVW estimate  = ", se.ivw)
## Standard error of the IVW estimate  =  0.2060492

Motivation for the inverse-variance weighted formula

In this section, we will provide three different methods that motivate the inverse-variance weighted formula defined above.

First, the IVW method can be motivated by the meta-analysis of the ratio estimates from the individual variants, using the first-order standard errors. Calculate the ratio estimates and first-order standard errors for the four genetic variants:

beta.ratio.all <- t(by / bx)
se.ratio.all <- t(byse / bx)

You can now perform an inverse-variance weighted meta-analysis using the metagen command from the {meta} package:

# library(meta)
cat("Perform a meta analyses with Beta and Se : \n")
## Perform a meta analyses with Beta and Se :
metagen(beta.ratio.all, se.ratio.all)
## Number of studies combined: k = 4
## 
##                                       95%-CI    z p-value
## Common effect model  0.5676 [0.1637; 0.9714] 2.75  0.0059
## Random effects model 0.5676 [0.1637; 0.9714] 2.75  0.0059
## 
## Quantifying heterogeneity:
##  tau^2 < 0.0001 [0.0000; 14.8943]; tau = 0.0008 [0.0000; 3.8593]
##  I^2 = 0.0% [0.0%; 84.7%]; H = 1.00 [1.00; 2.56]
## 
## Test of heterogeneity:
##     Q d.f. p-value
##  2.47    3  0.4802
## 
## Details on meta-analytical method:
## - Inverse variance method
## - Restricted maximum-likelihood estimator for tau^2
## - Q-Profile method for confidence interval of tau^2 and tau
cat(
  "Get TE (Estimate of treatment effect, e.g., log hazard ratio or risk difference) = ",
  metagen(beta.ratio.all, se.ratio.all)$TE.fixed, " \n"
)
## Get TE (Estimate of treatment effect, e.g., log hazard ratio or risk difference) =  0.567561
cat(
  "Get seTE.fixed (Standard error of treatment estimate) = ",
  metagen(beta.ratio.all, se.ratio.all)$seTE.fixed, " \n"
)
## Get seTE.fixed (Standard error of treatment estimate) =  0.2060492

The fixed-effect estimate and standard error using the metagen command is identical to the IVW estimate and standard error.
A fixed-effect analysis may not be appropriate if there is heterogeneity between the causal estimates.

Secondly, the IVW method can also be motivated as a ratio estimate using a weighted allele sore as an instrumental variable. We use the estimated genetic associations with the risk factor to create an allele score:

score <- coursedata$g1 * as.numeric(bx[1]) + coursedata$g2 * as.numeric(bx[2]) + coursedata$g3 * as.numeric(bx[3]) + coursedata$g4 * as.numeric(bx[4])

To calculate the ratio estimate and its standard error (first-order) using this score as an instrumental variable

ivmodel.score <- ivreg(coursedata$y ~ coursedata$x | score, x = TRUE)
cat("Get the Instrumental-Variable Regression estimates :\n")
## Get the Instrumental-Variable Regression estimates :
summary(ivmodel.score)
## 
## Call:
## ivreg(formula = coursedata$y ~ coursedata$x | score, x = TRUE)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -6.2324383 -1.3661564 -0.0009812  1.4010821  7.1175494 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  -0.06953    0.11435  -0.608   0.5433  
## coursedata$x  0.56732    0.22897   2.478   0.0134 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.031 on 998 degrees of freedom
## Multiple R-Squared: -0.278,  Adjusted R-squared: -0.2792 
## Wald test: 6.139 on 1 and 998 DF,  p-value: 0.01339

The results of the above model are very similar to those from using the genetic variants as the instrument.
The difference between the allele score and the fitted values is approximately the intercept term from the regression of the risk factor on all the genetic variants.

Thirdly, it is motivated by weighted linear regression of the genetic association estimates, with the intercept set to zero. Use the following code to fit the weighted linear regression model and obtain the causal estimate:

# rotates data to a column vector
BY <- t(by)
BX <- t(bx)
BYSE <- t(byse)
BXSE <- t(bxse)

regression <- lm(BY ~ BX - 1, weights = BYSE^-2)
cat("Get the weighted linear regression estimates :\n")
## Get the weighted linear regression estimates :
summary(regression)
## 
## Call:
## lm(formula = BY ~ BX - 1, weights = BYSE^-2)
## 
## Weighted Residuals:
##      g1      g2      g3      g4 
## -0.9774 -0.1790  0.6122  1.0539 
## 
## Coefficients:
##    Estimate Std. Error t value Pr(>|t|)  
## BX   0.5676     0.1871   3.034   0.0561 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9079 on 3 degrees of freedom
## Multiple R-squared:  0.7542, Adjusted R-squared:  0.6723 
## F-statistic: 9.205 on 1 and 3 DF,  p-value: 0.05613
cat("Get the estimates = ", summary(regression)$coef[1, 1], "\n")
## Get the estimates =  0.567561
cat(
  "And divide the standard error by the sigma = ",
  summary(regression)$coef[1, 2] / summary(regression)$sigma
)
## And divide the standard error by the sigma =  0.2060492
  • Questions ?

Why do we divide the standard error by the sigma quantity in the final line of code?

Answer : We want to avoid underdispersion (the estimate being more precise than from a fixed-effect meta-analysis). If sigma, which is the standard deviation of the variant-specific estimates, is less than 1, then the dispersion of the estimates is lower than one would expect due to chance alone (based on the precision of the estimates). However if sigma had been > 1, then the variant-specific estimates are more heterogeneous than one would expect due to chance alone based on the precision of the estimates. We want to allow for over dispersion to make confidence intervals wider, but but not allow underdispersion to make the confidence intervals narrower. So standard error of the causal estimate is divided by sigma (estimate of the residual standard error) to force the residual standard error to be at least one.

N.B. : Run a Weighted regression lm(betaY ~ betaX - 1, weights = 1/betaYse^2) is actually equivalent to use mr_input and mr_ivw from the R package {MendelianRandomization}.
Exacty same effect, Se, pval.
But lm have R-squared : this tell us how much of the variation of the genetic association with the outcome can be explained by a genetic association with the risk factor.

IVW method using the {MendelianRandomization} package

The {MendelianRandomization} package was specifically written to implement Mendelian randomization analyses using summarized data.

IVW Estimates (by MendelianRandomization)

Before using any of the estimation functions, we have to define a MRInput object to ensure that the data are correctly formatted.

bx.all <- ratio.all["bx", ]
by.all <- ratio.all["by", ]
bxse.all <- ratio.all["bxse", ]
byse.all <- ratio.all["byse", ]
MRObject <- mr_input(
  bx = as.numeric(bx.all),
  bxse = as.numeric(bxse.all),
  by = as.numeric(by.all),
  byse = as.numeric(byse.all)
)
cat("Run an inverse-variance weighted (IVW) model \n")
## Run an inverse-variance weighted (IVW) model
mr_ivw(MRObject) ## inverse-variance weighted (IVW) estimate
## 
## Inverse-variance weighted method
## (variants uncorrelated, random-effect model)
## 
## Number of Variants : 4 
## 
## ------------------------------------------------------------------
##  Method Estimate Std Error 95% CI       p-value
##     IVW    0.568     0.206 0.164, 0.971   0.006
## ------------------------------------------------------------------
## Residual standard error =  0.908 
## Residual standard error is set to 1 in calculation of confidence interval when its estimate is less than 1.
## Heterogeneity test statistic (Cochran's Q) = 2.4728 on 3 degrees of freedom, (p-value = 0.4802). I^2 = 0.0%.
## Or in a single line of code as
# mr_ivw(mr_input(bx = bx.all, bxse = bxse.all, by = by.all, byse = byse.all))

The estimates above are all the same.

Comparing fixed-effects and random effects

To understand the difference obtained by setting a fixed or random effet in mr_ivw, we will use the dataset diabetes_data providing betas and se for several snps.

head(diabetes_data) 
##   X        SNP beta.exposure beta.outcome se.exposure se.outcome
## 1 1 rs10184004   -0.00340017      -0.0121 0.000530106     0.0165
## 2 2 rs10487796   -0.00364702      -0.0237 0.000523323     0.0165
## 3 3 rs10748582   -0.00471516      -0.0029 0.000536941     0.0164
## 4 4 rs10830963    0.00441241       0.0105 0.000582103     0.0201
## 5 5 rs10965247   -0.00718076      -0.0246 0.000682275     0.0223
## 6 6 rs11671304    0.00319588      -0.0094 0.000558102     0.0184
MRObject2 <- mr_input(
  bx = diabetes_data$beta.exposure, bxse = diabetes_data$se.exposure,
  by = diabetes_data$beta.outcome, byse = diabetes_data$se.outcome
)

The default option is a fixed-effect analysis with 3 or fewer variants, and a random-effects analysis with 4 or more. A fixed effect analysis can be forced using:

mr_ivw(MRObject2, model = "fixed")
## 
## Inverse-variance weighted method
## (variants uncorrelated, fixed-effect model)
## 
## Number of Variants : 38 
## 
## ------------------------------------------------------------------
##  Method Estimate Std Error  95% CI       p-value
##     IVW    0.660     0.591 -0.498, 1.818   0.264
## ------------------------------------------------------------------
## Residual standard error =  1.168 
## Residual standard error is set to 1 in calculation of confidence interval by fixed-effect assumption.
## Heterogeneity test statistic (Cochran's Q) = 50.4475 on 37 degrees of freedom, (p-value = 0.0692). I^2 = 26.7%.

Note : fixed => no pleiotropic effect

A random effect analysis can be forced using:

mr_ivw(MRObject2, model = "random")
## 
## Inverse-variance weighted method
## (variants uncorrelated, random-effect model)
## 
## Number of Variants : 38 
## 
## ------------------------------------------------------------------
##  Method Estimate Std Error  95% CI       p-value
##     IVW    0.660     0.690 -0.692, 2.012   0.339
## ------------------------------------------------------------------
## Residual standard error =  1.168 
## Heterogeneity test statistic (Cochran's Q) = 50.4475 on 37 degrees of freedom, (p-value = 0.0692). I^2 = 26.7%.

Note : random => could be pleiotropic effect, but in average is background noise.

The confidence intervals are much larger with random effects.

Note : In a meta-analysis, a fixed-effect analysis means that all the studies estimate the same quantity. A random-effect analysis means that the studies can estimate different quantities, and the result is the average estimate. Formally, we usually assume that the study-specific estimates are normally distributed about a mean value with a given variance.

IVW method using the {TwoSampleMR} package

Warning: loading both the {MendelianRandomisation} and the {TwoSampleMR} packages can cause errors as they both include functions called mr_ivw. We recommend you try to stick to a single package within any given analysis.

Note: Recent changes to the googleAuthR have broken the way the {TwoSampleMR} package authenticates; in order to run this exercise as intended we will need to install an earlier version of the package. If you are getting errors use this install command. install_github('MarkEdmondson1234/googleAuthR@v0.8.1')

While MR-Base can be used on your own data, one of it’s main strength is the speed with which data for exposures and outcomes can be found and combined. Here we download two chosen datasets from the MR-Base website.

IVW Estimates (by TwoSampleMR)

library(TwoSampleMR)
# The following objects are masked from 'package:MendelianRandomization':
#     mr_ivw, mr_median

# this downloads instruments suitable for Diabetes
exposure_dat <- extract_instruments(c("ukb-b-12948"))

# this finds the data for same snps, but with outcome as Ischemic stroke
outcome_dat <- extract_outcome_data(
  exposure_dat$SNP, c("ieu-a-1108"),
  proxies = 1, rsq = 0.8, align_alleles = 1,
  palindromes = 1, maf_threshold = 0.3
)

# this combines the two data sets,
# automatically aligning snps
# BUT removing any snp that cannot be placed into matching alignment (!)
dat <- harmonise_data(exposure_dat, outcome_dat, action = 2)

# this fits the fixed and random effect ivw models
mr_results <- mr(dat, method_list = c("mr_ivw_mre", "mr_ivw_fe"))

# Note if you need to change back to the MendelianRandomization package
detach("package:TwoSampleMR", unload = TRUE)

MR-Egger and median-based methods using {MendelianRandomization} package

For this section of the practical, we use the example data provided in the {MendelianRandomization} package on the associations (beta-coefficients and standard errors) of 28 genetic variants with three risk factors: LDL-cholesterol (ldlc and ldlcse), HDL-cholesterol (hdlc and hdlcse), and triglycerides (trig and trigse); and with one outcome: coronary heart disease (chdlodds and chdloddsse). The associations with coronary heart disease (CHD) risk are log odds ratios (beta-coefficients from a logistic regression analysis).
The summary data for the risk factors and outcome were taken from the paper by Waterworth et al (2011) Genetic variants influencing circulating lipid levels and risk of coronary artery disease, doi: 10.1161/atvbaha.109.201020.
The 28 genetic variants in the example dataset reached genome-wide level of significance (p-value < 5 x 10^-8) with at least one of the three risk factors (LDL-cholesterol, HDL-cholesterol, and triglycerides) in the paper by Waterworth et al (2011).

Various Estimates

The MR-Egger method can be implemented with as few as 3 variants, however it will generally give imprecise estimates unless there is a good handful of variants. Hence we can use this dataset described above.

library(MendelianRandomization)

# creates mr_input objects
MR_LDLObject <- mr_input(bx = ldlc, bxse = ldlcse, by = chdlodds, byse = chdloddsse)
MR_HDLObject <- mr_input(bx = hdlc, bxse = hdlcse, by = chdlodds, byse = chdloddsse)
# fit some egger and median models
cat("Run the MR Egger model : \n")
## Run the MR Egger model :
mr_egger(MR_LDLObject)
## 
## MR-Egger method
## (variants uncorrelated, random-effect model)
## 
## Number of Variants =  28 
## 
## ------------------------------------------------------------------
##       Method Estimate Std Error  95% CI       p-value
##     MR-Egger    3.253     0.770  1.743, 4.762   0.000
##  (intercept)   -0.011     0.015 -0.041, 0.018   0.451
## ------------------------------------------------------------------
## Residual Standard Error :  1.935 
## Heterogeneity test statistic = 97.3975 on 26 degrees of freedom, (p-value = 0.0000)
## I^2_GX statistic: 91.9%
cat("Run the MR median-based method weighted : \n")
## Run the MR median-based method weighted :
mr_median(MR_LDLObject, weighting = "weighted")
## 
##  Weighted median method 
## 
## Number of Variants : 28 
## ------------------------------------------------------------------
##                  Method Estimate Std Error 95% CI       p-value
##  Weighted median method    2.683     0.419 1.862, 3.504   0.000
## ------------------------------------------------------------------
cat("Run the MR median-based method simple : \n")
## Run the MR median-based method simple :
mr_median(MR_LDLObject, weighting = "simple")
## 
##  Simple median method 
## 
## Number of Variants : 28 
## ------------------------------------------------------------------
##                Method Estimate Std Error 95% CI       p-value
##  Simple median method    1.755     0.740 0.305, 3.205   0.018
## ------------------------------------------------------------------
# graph mr models
cat("Run the all MR method : \n")
## Run the all MR method :
mr_allmethods(MR_LDLObject)
##                     Method Estimate Std Error 95% CI        P-value
##              Simple median    1.755     0.740   0.305 3.205   0.018
##            Weighted median    2.683     0.419   1.862 3.504   0.000
##  Penalized weighted median    2.681     0.420   1.857 3.505   0.000
##                                                                    
##                        IVW    2.834     0.530   1.796 3.873   0.000
##              Penalized IVW    2.561     0.413   1.752 3.370   0.000
##                 Robust IVW    2.797     0.307   2.195 3.399   0.000
##       Penalized robust IVW    2.576     0.251   2.083 3.069   0.000
##                                                                    
##                   MR-Egger    3.253     0.770   1.743 4.762   0.000
##                (intercept)   -0.011     0.015  -0.041 0.018   0.451
##         Penalized MR-Egger    3.421     0.531   2.380 4.461   0.000
##                (intercept)   -0.022     0.011  -0.043 0.000   0.051
##            Robust MR-Egger    3.256     0.624   2.033 4.479   0.000
##                (intercept)   -0.015     0.021  -0.055 0.026   0.474
##  Penalized robust MR-Egger    3.502     0.478   2.566 4.438   0.000
##                (intercept)   -0.026     0.014  -0.054 0.003   0.075
mr_allmethods(MR_LDLObject, method = "main")
##           Method Estimate Std Error 95% CI        P-value
##    Simple median    1.755     0.740   0.305 3.205   0.018
##  Weighted median    2.683     0.419   1.862 3.504   0.000
##              IVW    2.834     0.530   1.796 3.873   0.000
##         MR-Egger    3.253     0.770   1.743 4.762   0.000
##      (intercept)   -0.011     0.015  -0.041 0.018   0.451

Graphics

Scatter plot of genetic associations with the outcome (vertical axis) \(\hat{B_Y}\) against genetic associations with the exposure (horizontal axis) \(\hat{B_X}\).

The lines on each points represent the 95% confidence intervals for the genetic associations with the exposure and with the outcome.

The ratio estimate for each variant is the gradient of the line from the origin to the data point.

The slope of the regression combining n \(\theta\).

Representation by hand

For instance using estimation from weighted linear regression.

plot(
  x = BX, y = BY,
  xlim = c(min(BX - 2 * BXSE, 0), max(BX + 2 * BXSE, 0)),
  ylim = c(min(BY - 2 * BYSE, 0), max(BY + 2 * BYSE, 0))
)
for (j in 1:length(BX)) {
  lines(x = c(BX[j], BX[j]), y = c(BY[j] - 1.96 * BYSE[j], BY[j] + 1.96 * BYSE[j]))
  lines(x = c(BX[j] - 1.96 * BXSE[j], BX[j] + 1.96 * BXSE[j]), y = c(BY[j], BY[j]))
}
abline(h = 0, lwd = 1)
abline(v = 0, lwd = 1)

Representation of data using {MendelianRandomization} package

The function mr_plot for the graphical display of summarized data has two main modalities.

  • First, it can be applied to an MRInput object from the mr_input function.

For instance, let’s use MR_LDLObject or MR_HDLObject :

## plotly
mr_plot(MR_LDLObject)
mr_plot(MR_LDLObject, orientate = TRUE, line = "egger")
mr_plot(MR_LDLObject, interactive = FALSE)

mr_plot(MR_LDLObject, interactive = FALSE, labels = TRUE)

mr_plot(MR_LDLObject, interactive = TRUE, labels = TRUE)
  • Secondly, it can be applied to an MRAll object from the mr_allmethods function:
mr_plot(mr_allmethods(MR_LDLObject))

mr_plot(mr_allmethods(MR_HDLObject))

mr_plot(mr_allmethods(MR_HDLObject, method = "main"))

mr_plot(mr_allmethods(MR_HDLObject, method = "egger"))

As previously, please read through the help documentation (for example, type ?mr_plot), and explore the various functions.

Representation of data using {TwoSampleMR} package

See MR-Base website examples.

{TwoSampleMR} can also produce various plots for the data :

  • scatterplot
  • single snp analysis plot
  • leave-one-out plot
  • funnel plot
library(TwoSampleMR)
# scatter plot
mr_results <- mr(dat, method_list = c("mr_ivw_mre", "mr_ivw_fe"))
p1 <- mr_scatter_plot(mr_results, dat) # this creates a scattergraph of the results
p1[[1]]

# single snp plot
res_single <- mr_singlesnp(dat)
p2 <- mr_forest_plot(res_single)
p2[[1]]

# leave one out plot
res_loo <- mr_leaveoneout(dat)
p3 <- mr_leaveoneout_plot(res_loo)
p3[[1]]

# funnel plot # can diag bias in meta-analysis
# visually detect publication bias (selective outcome reporting)
res_single <- mr_singlesnp(dat)
p4 <- mr_funnel_plot(res_single)
p4[[1]]

# Note if you need to change back to the MendelianRandomization package
detach("package:TwoSampleMR", unload = TRUE)

Take-home Message

The final paragraph of the vignette on {MendelianRandomization} says:

It is important to remember that the difficult part of a Mendelian randomization analysis is not the computational method, but deciding what should go into the analysis: which risk factor, which outcome, and (most importantly) which genetic variants. Hopefully, the availability of these tools will enable less attention to be paid to the mechanics of the analysis, and more attention to these choices.The note of caution is that tools that make Mendelian randomization simple to perform run the risk of encouraging large numbers of speculative analyses to be performed in an unprincipled way. It is important that Mendelian randomization is not performed in a way that avoids critical thought.

Hackathon

Steps

  • Choose risk factor and outcome
  • Find variants
  • Find associations of variants with risk factor and outcome
  • Validate variants as instrumental variables*
  • Calculate Mendelian randomization estimate
  • Perform sensitivity and supplementary analyses

(* This is an important step, but one that you may want to omit here due to time)

1. Choose risk factor and outcome

I choose body mass index and cigarette smoking : are increases in BMI associated with increases in cigarette smoking in a causal sense?

N.B.: In the package {TwoSampleMR} there is function available_outcomes()

2. Find variants

I found a large GWAS investigation of BMI with 32 significant genetic variants: “Association analyses of 249,796 individuals reveal 18 new loci associated with body mass index” by Elizabeth Speliotes. I got the list of SNPs into R by pasting Table 1 and using a grep command

bmi_snps <- c("rs1558902", "rs2867125", "rs571312", "rs10938397", "rs10767664", "rs2815752", "rs7359397", "rs9816226", "rs3817334", "rs29941", "rs543874", "rs987237", "rs7138803", "rs10150332", "rs713586", "rs12444979", "rs2241423", "rs2287019", "rs1514175", "rs13107325", "rs2112347", "rs10968576", "rs3810291", "rs887912", "rs13078807", "rs11847697", "rs2890652", "rs1555543", "rs4771122", "rs4836133", "rs4929949", "rs206936")

3. Find associations of variants with risk factor and outcome

I used PhenoScanner to obtain genetic associations with BMI from GIANT and with cigarettes smoked per day from TAG (both are large GWAS consortia)

# phenoscanner() ## proxies = "None"

mr_obj <- pheno_input(
  snps = bmi_snps,
  exposure = "Body mass index",
  pmidE = "25673413",
  ancestryE = "European",
  outcome = "Cigarettes per day",
  pmidO = "20418890",
  ancestryO = "European"
)
## PhenoScanner V2
## Cardiovascular Epidemiology Unit
## University of Cambridge
## Email: phenoscanner@gmail.com
## 
## Information: Each user can query a maximum of 10,000 SNPs (in batches of 100), 1,000 genes (in batches of 10) or 1,000 regions (in batches of 10) per hour. For large batch queries, please ask to download the data from www.phenoscanner.medschl.cam.ac.uk/data.
## Terms of use: Please refer to the terms of use when using PhenoScanner V2 (www.phenoscanner.medschl.cam.ac.uk/about). If you use the results from PhenoScanner in a publication or presentation, please cite all of the relevant references of the data used and the PhenoScanner publications: www.phenoscanner.medschl.cam.ac.uk/about/#citation.
## 
## [1] "1 -- chunk of 10 SNPs queried"
## [1] "2 -- chunk of 10 SNPs queried"
## [1] "3 -- chunk of 10 SNPs queried"
## [1] "4 -- chunk of 10 SNPs queried"
# mr_obj ## show the whole table
slotNames(mr_obj)
##  [1] "betaX"         "betaY"         "betaXse"       "betaYse"      
##  [5] "exposure"      "outcome"       "snps"          "effect_allele"
##  [9] "other_allele"  "eaf"           "correlation"
# mr_obj@betaX
# mr_obj@betaY # ...

Other team examples

# Other example : with Phenoscanner website
# risk factor : cholesterol (total cholesterol, ldl or hdl)
# outcome : gallstones - gallbladder stones (calculs biliaires)

# library(MendelianRandomization)
library(tidyverse)

data_cholesterol <- readr::read_tsv(file = "./Practicals sessions/Practical5 Hackathon/cholesterol_PhenoScanner_GWAS.tsv")

data_gall <- readr::read_tsv(file = "./Practicals sessions/Practical5 Hackathon/gallstones_PhenoScanner_GWAS.tsv")

my_snp <- intersect(data_cholesterol$rsid, data_gall$rsid)

cat("my IV has", length(my_snp), " variants")

# head(data_cholesterol[data_cholesterol$rsid %in% my_snp, ]) %>%
#   as.data.frame()
# data_gall[data_gall$rsid %in% my_snp, ] %>%
#   as.data.frame()

my_obj <- pheno_input(
  snps = my_snp,
  exposure = "cholesterol",
  outcome = "gallstones",
  pmidE = "24097068",
  ancestryE = "European",
  pmidO = "17632509",
  ancestryO = "Mixed"
)
  • Questions ?

What if in pmidO we’ve selected a study with ancestryO = "Mixted" (or other than European) ? Would it be definitly a mistake ?
Or would it be accepted if we have checked that estimates (Bx and By) provided in studies “25673413” and “20418890” are well adjusted for ethnicity (with 5 PCs). In this case, is it ok?

SB Answer: Often the studies with “Mixed” population are +90% European, so often there’s not a big “mistake” even if it is a mistake. In reality though, it’s not a mistake, but it is best to have similarity between the two samples are far as possible. But even under the term “European”, there can be a lot of heterogeneity.

4. Validate variants as instrumental variables

I didn’t perform this step in this case, but one would want to for reliable making inferences

5. Calculate Mendelian randomization estimate

I calculated MR estimates in the {MendelianRandomization} package

mr_ivw(mr_obj)
## 
## Inverse-variance weighted method
## (variants uncorrelated, random-effect model)
## 
## Number of Variants : 31 
## 
## ------------------------------------------------------------------
##  Method Estimate Std Error 95% CI       p-value
##     IVW    1.465     0.490 0.504, 2.426   0.003
## ------------------------------------------------------------------
## Residual standard error =  0.955 
## Residual standard error is set to 1 in calculation of confidence interval when its estimate is less than 1.
## Heterogeneity test statistic (Cochran's Q) = 27.3475 on 30 degrees of freedom, (p-value = 0.6050). I^2 = 0.0%.
mr_plot(mr_obj, orientate = TRUE)
mr_allmethods(mr_obj, method = "main")
##           Method Estimate Std Error 95% CI        P-value
##    Simple median    1.534     0.729   0.105 2.963   0.035
##  Weighted median    1.526     0.704   0.147 2.906   0.030
##              IVW    1.465     0.490   0.504 2.426   0.003
##         MR-Egger    2.808     1.308   0.243 5.372   0.032
##      (intercept)   -0.051     0.046  -0.142 0.039   0.268

6. Perform sensitivity and supplementary analyses

I didn’t perform additional analyses in this case, but several interesting supplementary analyses could be performed


Suppl.

Questions

  • Assumption of linear effect X on Y. What if this is violated? need access to individual-level data. nice study looking at non-linear effects : https://www.bmj.com/content/364/bmj.l1042

  • One cohort but 2 sources data : It’s a One sample analysis (not two samples)

  • Explication about MV IVW Example is like summary(lm(chdlodds ~x1+x2+x3-1, weights = chdloddsse^-2))
    but lm use t-distribution and in {mr_mv} package, normal hypothesis done, so Normal dist used.

  • Do you have a simple programm to estimate the genetic risk score?

Info Session

sessionInfo()
## R version 4.1.3 (2022-03-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 11 (bullseye)
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.13.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
##  [1] MendelianRandomization_0.7.0 meta_6.2-1                  
##  [3] ivpack_1.2                   AER_1.2-10                  
##  [5] survival_3.2-13              sandwich_3.0-2              
##  [7] lmtest_0.9-40                zoo_1.8-12                  
##  [9] car_3.1-2                    carData_3.0-5               
## [11] rmarkdown_2.8                targets_0.4.2               
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-155        httr_1.4.7          numDeriv_2016.8-1.1
##  [4] tools_4.1.3         bslib_0.2.5         utf8_1.2.1         
##  [7] R6_2.5.0            metafor_4.2-0       lazyeval_0.2.2     
## [10] colorspace_2.0-1    withr_2.5.0         tidyselect_1.2.0   
## [13] processx_3.5.2      arrangements_1.1.9  compiler_4.1.3     
## [16] glmnet_4.1-8        cli_3.4.1           quantreg_5.95      
## [19] SparseM_1.81        xml2_1.3.2          plotly_4.10.1      
## [22] labeling_0.4.2      sass_0.4.0          scales_1.2.1       
## [25] DEoptimR_1.1-3      robustbase_0.99-0   callr_3.7.0        
## [28] stringr_1.5.0       digest_0.6.27       minqa_1.2.4        
## [31] pkgconfig_2.0.3     htmltools_0.5.6.1   lme4_1.1-33        
## [34] highr_0.9           fastmap_1.1.0       htmlwidgets_1.6.2  
## [37] rlang_1.1.1         farver_2.1.0        shape_1.4.6        
## [40] jquerylib_0.1.4     generics_0.1.0      jsonlite_1.7.2     
## [43] crosstalk_1.2.0     dplyr_1.1.2         magrittr_2.0.3     
## [46] Formula_1.2-5       metadat_1.2-0       Matrix_1.5-4.1     
## [49] Rcpp_1.0.6          munsell_0.5.0       fansi_0.4.2        
## [52] abind_1.4-5         lifecycle_1.0.3     stringi_1.7.12     
## [55] yaml_2.2.1          CompQuadForm_1.4.3  mathjaxr_1.6-0     
## [58] MASS_7.3-60         grid_4.1.3          lattice_0.21-8     
## [61] splines_4.1.3       knitr_1.33          ps_1.6.0           
## [64] pillar_1.9.0        igraph_1.2.6        boot_1.3-28        
## [67] rjson_0.2.20        codetools_0.2-18    iterpc_0.4.2       
## [70] glue_1.6.2          evaluate_0.14       data.table_1.14.0  
## [73] renv_0.16.0         BiocManager_1.30.15 vctrs_0.6.2        
## [76] nloptr_1.2.2.2      foreach_1.5.1       MatrixModels_0.5-1 
## [79] gtable_0.3.0        purrr_1.0.1         tidyr_1.3.0        
## [82] ggplot2_3.4.3       xfun_0.23           viridisLite_0.4.0  
## [85] tibble_3.2.1        iterators_1.0.13    gmp_0.7-2          
## [88] ellipsis_0.3.2